Integrating transcriptomic and proteomic data for accurate assembly and annotation of genomes.

نویسندگان

  • T S Keshava Prasad
  • Ajeet Kumar Mohanty
  • Manish Kumar
  • Sreelakshmi K Sreenivasamurthy
  • Gourav Dey
  • Raja Sekhar Nirujogi
  • Sneha M Pinto
  • Anil K Madugundu
  • Arun H Patil
  • Jayshree Advani
  • Srikanth S Manda
  • Manoj Kumar Gupta
  • Sutopa B Dwivedi
  • Dhanashree S Kelkar
  • Brantley Hall
  • Xiaofang Jiang
  • Ashley Peery
  • Pavithra Rajagopalan
  • Soujanya D Yelamanchi
  • Hitendra S Solanki
  • Remya Raja
  • Gajanan J Sathe
  • Sandip Chavan
  • Renu Verma
  • Krishna M Patel
  • Ankit P Jain
  • Nazia Syed
  • Keshava K Datta
  • Aafaque Ahmed Khan
  • Manjunath Dammalli
  • Savita Jayaram
  • Aneesha Radhakrishnan
  • Christopher J Mitchell
  • Chan-Hyun Na
  • Nirbhay Kumar
  • Photini Sinnis
  • Igor V Sharakhov
  • Charles Wang
  • Harsha Gowda
  • Zhijian Tu
  • Ashwani Kumar
  • Akhilesh Pandey
چکیده

Complementing genome sequence with deep transcriptome and proteome data could enable more accurate assembly and annotation of newly sequenced genomes. Here, we provide a proof-of-concept of an integrated approach for analysis of the genome and proteome of Anopheles stephensi, which is one of the most important vectors of the malaria parasite. To achieve broad coverage of genes, we carried out transcriptome sequencing and deep proteome profiling of multiple anatomically distinct sites. Based on transcriptomic data alone, we identified and corrected 535 events of incomplete genome assembly involving 1196 scaffolds and 868 protein-coding gene models. This proteogenomic approach enabled us to add 365 genes that were missed during genome annotation and identify 917 gene correction events through discovery of 151 novel exons, 297 protein extensions, 231 exon extensions, 192 novel protein start sites, 19 novel translational frames, 28 events of joining of exons, and 76 events of joining of adjacent genes as a single gene. Incorporation of proteomic evidence allowed us to change the designation of more than 87 predicted "noncoding RNAs" to conventional mRNAs coded by protein-coding genes. Importantly, extension of the newly corrected genome assemblies and gene models to 15 other newly assembled Anopheline genomes led to the discovery of a large number of apparent discrepancies in assembly and annotation of these genomes. Our data provide a framework for how future genome sequencing efforts should incorporate transcriptomic and proteomic analysis in combination with simultaneous manual curation to achieve near complete assembly and accurate annotation of genomes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pinstripe: a suite of programs for integrating transcriptomic and proteomic datasets identifies novel proteins and improves differentiation of protein-coding and non-coding genes

MOTIVATION Comparing transcriptomic data with proteomic data to identify protein-coding sequences is a long-standing challenge in molecular biology, one that is exacerbated by the increasing size of high-throughput datasets. To address this challenge, and thereby to improve the quality of genome annotation and understanding of genome biology, we have developed an integrated suite of programs, c...

متن کامل

Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.)

The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onio...

متن کامل

Comprehensive Annotation of the Parastagonospora nodorum Reference Genome Using Next-Generation Genomics, Transcriptomics and Proteogenomics

Parastagonospora nodorum, the causal agent of Septoria nodorum blotch (SNB), is an economically important pathogen of wheat (Triticum spp.), and a model for the study of necrotrophic pathology and genome evolution. The reference P. nodorum strain SN15 was the first Dothideomycete with a published genome sequence, and has been used as the basis for comparison within and between species. Here we ...

متن کامل

KEGGexpressionMapper allows for analysis of pathways over multiple conditions by integrating transcriptomics and proteomics measurements

Motivation: In transcriptomic and proteomics-based studies, the abundance of genes is often compared to functional pathways such as the Kyoto Encyclopaedia at Genes and Genomes (KEGG) to identify active metabolic processes. Even though a plethora of tools allow to analyze and to compare omics data in respect to KEGG pathways, the analysis of multiple conditions is often limited to only a define...

متن کامل

Comparative Omics-Driven Genome Annotation Refinement: Application across Yersiniae

Genome sequencing continues to be a rapidly evolving technology, yet most downstream aspects of genome annotation pipelines remain relatively stable or are even being abandoned. The annotation process is now performed almost exclusively in an automated fashion to balance the large number of sequences generated. One possible way of reducing errors inherent to automated computational annotations ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Genome research

دوره 27 1  شماره 

صفحات  -

تاریخ انتشار 2017